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WAITING FOR REGULATORY SEQUENCES TO APPEAR 

By Richard Durrett^ and Deena Schmidt^ 

Cornell University 

One possible explanation for the substantial organismal differ- 
ences between humans and chimpanzees is that there have been 
changes in gene regulation. Given what is known about transcription 
factor binding sites, this motivates the following probability question: 
given a 1000 nucleotide region in our genome, how long does it take 
for a specified six to nine letter word to appear in that region in some 
individual? Stone and Wray [Mol. Biol. Evol. 18 (2001) 1764-1770] 
computed 5,950 years as the answer for six letter words. Here, we will 
show that for words of length 6, the average waiting time is 100,000 
years, while for words of length 8, the waiting time has mean 375,000 
years when there is a 7 out of 8 letter match in the population con- 
sensus sequence (an event of probability roughly 5/16) and has mean 
650 million years when there is not. Fortunately, in biological reality, 
the match to the target word does not have to be perfect for binding 
to occur. If we model this by saying that a 7 out of 8 letter match is 
good enough, the mean reduces to about 60,000 years. 

1. Introduction. At a genetic level humans and chimpanzees are closely 
related, with 98.7% of their DNA identical. It has long been speculated that 
many of the obvious differences between the two species are due to changes in 
regulatory sequences that control how genes are expressed (King and Wilson 
[15]). A regulatory sequence is a short sequence of DNA (in vertebrates many 
are 6-9 nucleotides long) which is a binding site for transcription factors 
that promote or inhibit transcription of the DNA to make proteins. It does 
not take an advanced knowledge of biology to appreciate the importance 
of gene regulation. All of the cells in our body have the same 3 billion 
nucleotide instruction set, but each of the tissues in our body requires its 
own specialized set of proteins. 
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Comparison of experimentally verified promoters in 51 genes by Dermitza- 
kis and Clark [7], showed that roughly 1/3 of sites functional in humans are 
not functional in rodents. This shows that in the 90 million years since the 
divergence of humans and rodents, there has been a significant evolution of 
transcription factor binding sites. However, there are only a few documented 
cases of human specific regulatory evolution since our divergence from the 
other primate; see [14] and [19] for recent examples and references to earlier 
work. 

These observations suggest the question: is the evolution of regulatory 
sequences sufficiently rapid to contribute to the differences between humans 
and chimpanzees? To begin to turn this into a probability problem, we note 
that regulatory sequences occur in the upstream region of a gene (i.e., be- 
fore the gene if we are reading the DNA strand in the order in which it is 
transcribed). They are typically within 1 kb (kilobase = 1000 nucleotides) of 
the beginning of the gene. Our probability question then is: given a word of 
length W from the DNA alphabet {A, C, G, T} and a 1 kb region, how long 
do we have to wait for mutation to create this word in a given 1 kb region in 
the genome of some human? In formulating the question, we are assuming 
that once created, the new mutation will confer a substantial benefit and, 
hence, will with high probability spread rapidly through the population. We 
will return to this point in Section 2.5. 

Stone and Wray [20] studied this problem by simulation. The numbers 
they give in the first column of page 1767 are not consistent with the values 
in their Table 1 for a 2 kb region. Working backward and rounding to simplify 
the arithmetic, we infer that they found that, for words of length W = 6, the 
waiting time for the appearance of a word in a 2 kb region of one individual 
required an average of 952 mutations or 0.476 mutations per nucleotide. 
They use /i = 10~^ as an estimate of the mutation rate per nucleotide per 
generation, so this translates into 476 million generations. To convert this 
into an estimate for the human population, they assumed that all individuals 
evolve independently and divided by 10^ individuals times 2 DNA strands 
(since humans are diploid) to arrive at an estimate of 238 generations. Using 
25 years for the generation time of humans, they concluded that the desired 
new sequence would appear in an average of 5,950 years. 

As MacArthur and Brookfield [17] have already pointed out, there are 
two problems with this computation. The first is that humans are closely 
related so the evolution of the DNA sequences of different individuals is not 
independent. Indeed, two randomly chosen individuals differ in 1 of 1000 
nucleotides of their DNA. The second problem is that 10^ is a substantial 
overestimate for the "effective size" of the human population. To explain 
the last term, we note that in a homogeneously mixing population of N 
diploid individuals, the genealogy of two copies of a nucleotide coalesces 
in each generation with probability 1/2N and each nucleotide experiences 



REGULATORY SEQUENCE EVOLUTION 



3 



mutations with probability 2/x. Thus, the probabihty of a mutation before 
coalescence, which will make the two individuals different, is 

l/2N + 2fi ~ 1 + ANfi' 

Using fi = 2.5 X 10"*^ for the mutation rate to simplify the arithmetic, setting 
the fraction equal to 1/1000, approximating the denominator by 1, and 
solving gives = 10^, a number that is commonly used for the effective 
size of the human population. Using 10^ instead of 10^ in Stone and Wray's 
computation increases their estimate to 595,000 years. However, the far more 
substantial problem is the lack of independence. This further increases the 
time and brings up the question: is the evolution of regulatory sequences 
rapid enough to make a substantial contribution in the 6 million years since 
the divergence of humans and chimpanzees? 

We build up to our solution (case 4 below) by considering the waiting 
time to find a prespecified W letter DNA word in the following steps: 

1. W nucleotides in one DNA sequence. 

2. A segment of L nucleotides in one DNA sequence. 

3. W nucleotides in a population of N diploid individuals. 

4. A segment of L nucleotides in a population of diploid individuals. 

2. Results. In this section we will describe our results that give approxi- 
mate answers for the four problems and their implications for biology. Before 
entering into the details, we recall some of Aldous' [1] thoughts on the phi- 
losophy of approximations, heuristics and limit theorems: 

"The proper business of probabilists is computing probabilities. Often exact 
calculations are tedious or impossible, so we resort to approximations. A limit 
theorem is an assertion of the form 'the error in a certain approximation tends 
to as (say) N ^ oo.' Call such a limit theorem naive if there is no explicit 
bound in terms of N and the parameters of the underlying process. Such theo- 
rems are so prevalent in theoretical and applied probability that people seldom 
stop to ask their purpose. Given a serious applied problem involving specific 
parameters, the natural first steps are to seek rough analytic approximations 
and to run computer simulations; the next step is to do careful numerical 
analysis. It is hard to give any argument for the relevance of a proof of a 
naive limit theorem, except as a vague reassurance that your approximation 
is sensible, and a good heuristic argument seems equally reassuring." 

Throughout we will concentrate on DNA words of length W = 6 and W = 
8, which are sizes appropriate for human transcription factor binding sites 
and have different qualitative behavior. For the first and simplest problem, 
we are able to clearly quantify the errors in our approximation. For the 
second problem, a result of Arratia, Goldstein and Gordon [3] bounds the 
error in an associated Poisson approximation. However, to translate this into 
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a result about waiting times, and to study the more complex third and fourth 
problems, the best we have been able to do is to suggest approximations and 
explain why they should be accurate. Because of this, we label our results 
as approximations rather than theorems. The main problem is that since we 
are interested in small values of W, it is not sensible to investigate the limit 
W ^ oo. Perhaps others can succeed in finding more precise estimates of our 
errors. However, as we will see later, even our current level of understanding 
allows us to make important qualitative conclusions about the tempo of 
evolution of regulatory sequences. These have been mentioned in the abstract 
and will be explained in more detail in Section 2.5. 

2.1. W nucleotides in one DNA sequence. The simplest situation occurs 
when we consider waiting for a target W letter word to appear at a specified 
W nucleotides in one DNA sequence. To remove the waiting times between 
mutations, we consider a discrete time Markov chain X„ that changes each 
time there is a mutation and gives the number of letters that match the W 
letter target sequence. In biological reality not all mutations occur with equal 
probability, but for mathematical simplicity, we will suppose that they do. 
This simplification should not affect the order of magnitude of the estimate 
of our waiting time. has state space S = {0,1,..., V7} and transition 
probabilities 

p{x, X — 1) = x/W^ 

(1) p{x,x + l) = {l/2>){W -x)/W, 

p{x,x) = {2/2,){W -x)/W, 

where all other p{x,y) = 0. To explain this, we note that the state decreases 
by one if we choose a matching letter to mutate. If we choose a nonmatching 
letter, the state increases by one if and only if we get the right one of the 
three possible mutations. In equilibrium, letters are random so the stationary 
distribution vr for Xn is Binomial(iy^, 1/4). 

To understand the distribution of T\y = inf{t : Xt = W}, we use Proposi- 
tion 23 in Chapter 3 of [2] which implies that if we consider the continuous 
time chain Xt that jumps at rate 1, and A is any set, the hitting time 
Ta = inf{t : Xt E A} satisfies 

(2) sup \P^{Ta >t)- cxp{-t/E^TA)\ < T2/E^Ta, 

t 

where T2 is the relaxation time, which Aldous and Fill define (see [2], page 19) 
to be 1 over the spectral gap. The continuous time chain Xt has T2 = 2>W/A 
(see the Appendix for details). As we will see in a moment, E^^Tw > 4^, so 
the error is <0.0011 when W = 6 and <0.0001 when W = 8. 

To compute the mean of Tw, we first consider starting the chain at the 
target sequence, that is, Pw{Xq = W) = 1. Let T+ = inf{n > 1 : Xn = x}. A 
classic result of Kac (see, e.g.. Theorem (3.3) in Chapter 6 of [9]) states: 
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W = 6 


W = 8 


EttTw 


4,420 


69,088 




4,431 


69,104 


4^/(1 -a) 


4,456 


69,152 



Theorem 1. If Xn is an irreducible discrete time Markov chain on a 
finite state space with stationary distribution vr, then E^T^ = l/7r(x). 

In our case, EwT^ = l/-KiW) = 4^. To compute EqTw, we let a = 
Pwi^w ^ -^o) = Pw-i{Tw < ^o)- III the Appendix we show that 

W 6 8 
a 0.08093 0.05228. 

Assuming that return times with Ty^^ < Tq are small and dropping them 
from the expected value, we have 

Ew^w ~ (1 ~ a)EQTiY 

or EqT\y ~ 4^/(1 — a). Here and in what follows ~ is read "is approxi- 
mately" and has no precise mathematical meaning. 

This is also the answer that comes from Aldous' Poisson clumping heuris- 
tic [1]. "Sparse random sets often resemble i.i.d. clumps centered at points 
of a Poisson process." Here the clumps consist of the returns to W that 
come soon after the first one, which we make precise as returns to W before 
hitting 0. Once the chain hits W, then it will visit W for a geometric number 
of times with mean 1/(1 — a) before it hits 0, so to get the correct mean, we 
should multiply l/7r{W) by the expected clump size EC = 1/(1 — a). 

Since Pt^{Tw < To) is small, we expect that Ej^Tw ^ EoTw ~ 4^/(1 — a). 
As Table 1 shows, the error in the approximation is less than 1% when 
W = 6 and less than 0.1% when 1^ = 8. 

The derivation of the numbers in the table can be found in the Appendix, 
where we compute other quantities that we need associated with the muta- 
tion chain. 

Approximation 1 . For the continuous time mutation chain that jumps 
at rate 1, under P,^, T\y is approximately exponential with mean 4^/(1 — a). 

To be precise, (2) shows that the distribution function P^(Tvy > t) is 
uniformly within 3W/4^^^ of the exponential with mean Et^T\y, and our 
numerical computations show E^^Tiy differs from 4^/(1 — a) by less than 
1%. 
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2.2. A segment in one DNA sequence. We next consider waiting for a 
target W letter word to appear somewhere in a sequence of L nucleotides in 
one DNA sequence. To be precise, we have a Markov chain Xt G {A, C, G, T}^ 
that jumps at rate L, and when it jumps a randomly chosen letter is changed 
to a different randomly chosen letter. 

While L is general, we are thinking about L as being 1 or 2 thousand 
nucleotides. To avoid problems with the ends of the region, we will make 
the nonbiological assumption that our DNA region is a circle. It follows from 
symmetry that if we start from a random initial configuration, then when 
the word first appears, the probability it occurs in the circle but not in the 
flat DNA is {W — 1)/L, which is small for the values of W and L that we 
consider. 

Let A be the event that the target word is found somewhere in the se- 
quence of L nucleotides (with wrap around). This time (2) does not work 
well, since if each site jumps at rate 1, T2 = 3/4 and as we will see, E^^Ta ~ 

/{WL). Thus, the error bound is (3W^L)/4^+i. When L = 1024 = 4^ this 

is 

18/16 for = 6 and 6/64 for = 8. 

As we will see in a minute, there is a good reason for a bad bound when 
W = Q. The waiting time starting from a random initial condition has an 
atom at of size ~l/4, so the distribution cannot be well approximated by 
an exponential. It is for this reason we approach the waiting time problem 
using a version of the Chen-Stein method described in [3]. 

Let I = {0,1,2,..., L - 1}. For a e I, let A„ = {a, a + 1, . . . , a + - 
IjmodL be the W letter strip starting at index a. Fix a time T > and let 
Ya be 1 if the target word appears in Aa by time T, and otherwise. Note 
that the definition of Ya depends on T even though we have not recorded this 
dependence in the notation. Let = {a — iW — 1), . . . ,a + iW — l)}modL. 
If /3 € I\Ba, then Ya and Yg are independent, so we call Ba the neighborhood 
of dependence of a. 

Let Pa = PiYa = 1), let V = X^ae/^a be the number of occurrences of 
the target word by time T and A = EV be the expected value. Let Z be 
a Poisson random variable with EZ = A. We want to estimate the total 
variation distance d^YiV, Z) = sup^ A) — P{Z £ A)\. Define 

bi = ^ P^PP^ 
(3) 62 = E E Ei^-Yf,], 

QG//3gBc\{a} 
b3 = J2E\E[Ya-pa\Y(,:(3^Ba]\. 
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In our case, Ya is independent of Yg, f3 ^ B^, so 63 = and Theorem 1 in [3] 
simplifies to 

(4) dTv{V,Z)<2{bi + b2) 
The last factor is always <1, but can be very helpful for large A. 

a. Success in the initial condition. If we put down a random sequence 
of L nucleotides, then the expected number of times the target word is 
present in a random initial configuration is 7 = Taking L = 1024 = 4^ 
for simplicity, 

W 6 8 

-f = L/A^ 0.25 0.015625 
l-e-T 0.22119 0.015504. 

When W = 8, = 0.015625 gives an upper bound on the probability 

of a match in the initial condition, so we will ignore this possibility in our 
approximation. We will now use the Arratia, Goldstein and Gordon [3] result 
stated in (4) to argue for the following: 

Approximation 2a. When W = 6 and L = 1024 if we exclude 52 repet- 
itive words, then the number of matches in the initial condition is approxi- 
mately Poisson with mean 1/4, with an error of at most 0.0063 in the total 
variation distance. 

When T = 0, = 4"^ so 

(5) bi = L{2W -1)pI = ^{2W , 

which is 0.002687 when W = 6. As we will see in Section 3, the bound on 
62 depends on the values of k for which the word and its shift by k letters 
agree exactly on the overlap. In Table 2 we have given results for the possible 
values of k and an example pattern for each case. 

Ignoring the first three categories, which account for 48 of the 4092 non- 
constant words of length 6, the maximum value of 2(6i + 62) is 0.024907 = 
0.006225. 

b. Number of occurrences by time T. Our next task is to use the Arratia, 
Goldstein and Gordon [3] result to find a Poisson approximation for the 
number of occurrences of the target word in a segment of length L by time 
T. To evaluate the bound in (4), we note that, as in (5), 

(6) bi = L{2W- 1)pI = {2W -1)/L, 
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k 


Pattern 


62/7 


2,4 


ACACAC 


0.13281 


3, 5 


ACAACA 


0.03320 


3 


ACGACG 


0.03125 


4, 5 


AACGAA 


0.00977 


4 


ACGTAC 


0.00781 


5 


ACGTCA 


0.00193 


none 


ACGTAG 






Table 3 





tv 




tv 


AACCGT 


0.134229 


ACAGCTGT 


0.070616 


ACGGTA 


0.142948 


ACAAGGGC 


0.075011 


ACAGCA 


0.183293 


ACAGACAG 


0.100627 


AACGAA 


0.229230 


AAAAAACA 


0.145996 


ACAACA 


0.302622 


AACAACAA 


0.163626 


ACACAC 


0.465964 


ACACACAC 


0.337132 



since A = Lpa ■ Again the estimation of 62 is more complicated than that of 
bi and depends on how the word overlaps with its shifts (see Section 3.2 for 
more details). Suppose, for concreteness, that L = 1024, W = G and A = 1. 
In each case 61 = 0.010742. We have computed the bounds on the total vari- 
ation distance (tv) for all 4092 nonconstant words of length 6 and selected 
six words to illustrate the range of estimates of the total variation bound 
from the best AACCGT to the worst ACACAC. We have also done the 
computation for the 65,532 nonconstant words of length 8. In each case 
bi = 0.014648. This time the word ACACCTGT achieves the best value, 
while ACACACAC is again the worst (see Table 3). 

Even our best result is not very comforting for someone who wants to use 
the result to estimate probabilities. To improve the quality of our approxima- 
tion, we remember the Poisson clumping heuristic. We define new indicator 
variables Ya that only count a hit in strip Aa if there has been no hit in any 
overlapping strip Ap for f3 G Ba since the last time the number of matches 
in the strip A^ was 0. If two or more hits occur simultaneously, we pick one 
of the a at random to have Ya = 1, and ignore the others. This eliminates 
any beneficial effect from a hit in one strip on hits in overlapping strips, so 
the two indicator variables are negatively correlated, E(YaYp) <papp-, and 

62 < L{2W - 2)pI = X^{2W - 2)1 L. 
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Combining this with the bound for bi in (6), we have 

Theorem 2. Let pa = -P(^a = 1) and A = Lpa- Let V = J2aei be the 
number of occurrences under the new counting scheme and Z be a Poisson 
random variable with EZ = A, then 

.,„(F,Z)<H(li^A(l-.-^), 

To compare with the results in the table above, when L = 1024 and A = 1, 
the bound is 

0.02593 for = 6 and 0.03580 for W = 8. 

The good news about Theorem 2 is that the bound no longer depends on 
the word, and in case W = 6 is a dramatic improvement of the best case 
of the previous bound. The bad news is that it is difficult to analytically 
compute A. Based on results of Aldous [1], we guess that A = X/EC, where 
EC is the expected clump size. In Section 3.3 we derive an approximation 
for the clump size EC (where the hat indicates it is an approximation) by 
starting with the target word in the strip Aa for some a I, random letters 
outside and computing the expected number of occurrences in overlapping 
strips Af^ for (3 € before the number of letters matching the target word 
hits in that strip. 

To evaluate the quality of the approximation, we turn to simulation. Since 
each letter changes at rate 1 , the naive guess for the waiting time in one strip 
is 4^ /W. If we had L independent copies, this would reduce the time to 
4}^/WL. We will do our simulations for the embedded jump chain so this 
translates into 4}^ /W mutations, which is 682.67 for W = 6 and 8192 for 
W = 8. Multiplying by our estimate of the clump size gives our prediction 
for the mean. In the table below, we compare with E(Ta\Ta > 0) since the 
Poisson clumping heuristic estimates the mean time between occurrences. 
As simulation results given in the next table indicate, our formula is not very 
successful at predicting the mean number of mutations needed to produce 
the event A = {the target word is found somewhere in the L nucleotides} 
when Vl^ = 6, or when W = 8 and the word is AC AC AC AC. On the other 
hand, these results show that the naive estimate of 4^ /W mutations is 
never wrong by more than 20% (see Table 4). 

To understand the reason for this, note that EC is computed under the 
assumption that when the word occurs the letters in adjacent positions are 
random. However, if we are looking at the first occurrence, then we are 
conditioning on the word having not been seen in adjacent strips. 
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Table 4 



Word 


E{Ta\Ta > 0) 


(4^/W)-£C EC 


AACCGT 


717.32 


770.97 


1.129 


ACGCTA 


719.45 


773.65 


1.133 


ACAGCA 


729.49 


785.50 


1.150 


AACGAA 


732.79 


797.97 


1.171 


AACAAC 


746.42 


823.96 


1.210 


ACACAC 


806.85 


900.34 


1.318 


AGAGCTGT 


8674 


8704 


1.0624 


ACAAGGGC 


8685 


8737 


1.0665 


AGAGAGAG 


8722 


8881 


1.0841 


AAAAGAAA 


8825 


8874 


1.0832 


AAGAAGAA 


9013 


9106 


1.1116 


AGAGAGAG 


9584 


10037 


1.2253 



c. Waiting time distribution. Putting together the results from parts a 
and b suggests: 

Approximation 2b. Under P-,^ the waiting time for the target word to 
appear somewhere in one DNA sequence of length L is 

ff^ (l-e~T)<5o + e-^^, whenW = 6, 

when W = 8, 

where 7 = and^, has an exponential distribution with mean / LW)EC . 

We have argued in the first part of this section that the size of the atom 
at is accurately estimated by 1 — e~'^ when W = & and by when W = 
8. Consider first the case W = 8. Since the waiting time in one strip is 
approximately exponential with mean 4}^ /W and the waiting time for the 
segment of length L is roughly /LW, if the time T in the definition of 
Ya is of this order of magnitude, 

Pa = P{Y^ = l)^TW/4:^, 

where is the density of the exponential at 0. Taking into account the 

correction for clumping, 

X = Lpa^ Lp^/EC = TLW/{A^EC). 

When = 6 this reasoning can be applied at positive times. 

To check the prediction of an exponential distribution of positive values. 
Figures 1 and 2 show simulation results for waiting times for AACCGT and 
ACACAC. In each case, we group hitting times into bins of size 100 and 
plot the logarithm of the number of observations versus the hitting time 
count. 
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2.3. W nucleotides in a population. Consider a population of diploid 
individuals. Following a standard practice in mathematical population ge- 
netics, we will formulate the dynamics of the Moran model as if there 
were 2N haploid individuals. Each individual, a string of W letters from 
{A, C, G, T}, is replaced at rate 1, that is, individual x lives for an exponen- 
tially distributed amount of time (with mean 1) and then is replaced. To 
replace individual x, we choose at random from the whole set of individu- 
als, including x itself, and make a copy. With probability Wfi, we randomly 
change one of the letters, and the new individual joins the population, re- 
placing X. 

We assume that, initially, everyone in the population has the same ran- 
domly chosen W letter DNA word. To explain the reason for this assumption, 
we note that a standard result of population genetics implies that if we draw 
the genealogical tree tracing the entire population back to the most recent 

AACCGT, TOOK simulalions 

10 1 T 1 1 1 1 1 r 1 " 




llme/100 

Fig. 1. Waiting time for target word AACCGT in 100,000 replications of the fixation 
chain simulation. Crouping hitting times into bins of size 100, we plot the logarithm of 
the number of observations versus the hitting time count. Despite the bad total variation 
bound, we see a good fit of the exponential distribution. 
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ACACAC. 100K simufations 
lOi i 1 1 1 1 1 — 




10 20 30 40 50 60 70 60 90 100 

lima/I OO 

Fig. 2. Waiting time for target word ACACAC in 100,000 replications of the fixation 
chain simulation. Grouping hitting times into bins of size 100, we plot the logarithm of 
the number of observations versus the hitting time count. Despite the bad total variation 
bound, we see a good fit of the exponential distribution. 

common ancestor, the expected total time in the tree is 

2N -j^ 2N-1 

(7) 2iVE^-7fc7=4^ E l/i~4A^ln(2A^). 

k=2 (2) j=l 

The factor of 2N is the time scahng needed for a population of N diploids to 
end up with the coalescent in which k lineages coalesce to A; — 1 at rate (2) • 
For a more detailed explanation of this and the other facts from population 
genetics we make use of, the reader should consult Durrett [8], Ewens [11] 
or Tavare [21]. 

From (7) we see that the expected number of mutations on the tree is 
Wfi ■ 4:Nln{2N). For reasons that will become clear in the next subsection, 
we will concentrate here on the case W = 8. Taking jj, = 10~® and W = 
8, the result is 0.0316. In words, when N = 10^, our value for the human 
population, 96.8% of the time there is no variation in the population. 
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Let Xt{i) G {A,C,G,T} be the state of individual i at time t. Let F = 
{t>0: Xt{i) = Xt{l) for all i} be the set of times when all individuals have 
the same W letter word. Here F is for fixation, which is the genetics term 
for one word being present in all members of the population. Let 7^ be the 
time of the nth fixation: 

T„ = m{{t > Tn-i -.teF, Xtil) + Xr^_, (1)}, 

where Tq = 0. Let = X-j-„{^)- is our discrete time fixation chain, which 
gives the state of the word in the population after the nth fixation. Let 
Ln G {0,1,..., VF} be the number of letters in y„ that match the target 
word, and let = inf {n : L„ = A:} be the first time that L„ hits state k. Let 

2N^l/9W _ m'^n/9W 
^~ 1/2N + 2Nfi/9W ~ l + AN^n/9W' 

In terms of the fixation chain, we can state: 

Approximation 3. Let ^i,^2,-- - G {0,1} be independent with = 
1) = p. Let S = inf{n : Ln = W — 1, or Ln = W — 2 and ^„ = 1}. The expected 
time to find the target word in W nucleotides in a population of size N = 10^ 
is ^E^S/{Wfi). 

Using information on expected hitting times from the Appendix, we find 

W Q 8 
E^S 214 2300. 

Since /x = 10~^, these numbers translate into huge waiting times: 3.567 x 10^ 
and 2.875 x 10^*^ generations, or 89.2 and 719 billion years respectively (using 
25 years as the human generation time). This shows that it is important that 
regulatory sequences can occur in some region rather than at a fixed location. 

Remark. As calculations below will show, Approximation 3 becomes 
exact if ^ oo with A^/i 1 and is not accurate unless N^fi^ is small. 
Thus, it is not valid for Drosophila populations with effective population 
size N = 10^ and mutation rate fi = 10~^. 

Explanation. A new mutant is introduced into the population accord- 
ing to a Poisson process with rate 2A • Wfj,, and it goes to fixation with 
probability 1/2 A. Thus, fixations occur at rate Wfi and we will need of or- 
der 2A mutations before one reaches fixation. Since any W letter word has 
only 3W one mutation neighbors, we can see that soon after Ln = W —1 we 
will have a mutation that gives us the target word. 
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The next step is to consider the possibihty that the target word will 
be reached when Ln = W — 2. Once a mutation occurs introducing a new 
word, the number of individuals with the mutant word evolves according 
to the Moran model. In Section 4 we will describe the dynamics of this 
model in detail and show that (for N = 10'^ and ^ = 10~^), conditional on 
the mutation dying out, the probability of a second mutation before the 
first one dies out is NW = 8 x 10~^. Taking into account that in order to 
increase L„ from W — 2 to W, we need one of two possible mutations at the 
start and the right second mutation, the probability is 

(8) -^■NW,.^='-^. 

Since fixation has probability 1/2N, the probability of reaching the target 
word before the next fixation is p. To rule out success when Ln = W — 3, 
we use the reasoning for (8) to conclude that when W = 8, the expected 
number of good triple mutations (i.e., ones that will increase Ln = W — 3 
to W) before we find the target word is 4.44 x 10~^ (see Section 4 for more 
details) . 

At this point we have shown that if we measure time in units of the 
discrete time fixation chain, then Approximation 3 gives a very accurate 
approximation of the waiting time. To return to continuous time, we scale 
up by multiplying by the mean waiting time between fixations, and recall 
that, for any random variable, the mean of the sum is the sum of the means. 

2.4. A segment in a population. We consider a population of N diploid 
individuals, following the dynamics of the Moran model, but this time with 
a string of L letters from {A, C, G, T} indexed by the integers mod L to 
avoid boundary effects. As in Section 2.2, we begin with: 

a. Almost matches in a random DNA sequence of length L. Pick a se- 
quence at random from {A, C, G, T}^ . In one window of width 8 the proba- 
bility of matching 7 out of 8 letters is 8(3/4)(l/4)'^, so the expected number 
of almost matches in L = 1024 = 4^ nucleotides is 3/8. Letting Mi be the 
number of "matches minus i" in the initial condition, that is, words that 
disagree with the target word in i letters, it is easy to see that 

W 6 8 
EMi 4.5 0.375 
EM2 33.75 3.9375. 



Approximation 4a. The number of matches minus i in the initial con- 
figuration is approximately Poisson with mean EMi, for i = 1,2 as given in 
the table. 
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In principle, one can use the methods in Section 3 to bound the error in 
these Poisson approximations, but the details get very messy and the bound 
depends in a complicated way on how the word overlaps with its shifts. For 
the 384 best words, the total variation distance is <0.01, and in more than 
75% of cases the distance is <0.1, but in other cases one would need to use 
the Poisson clumping heuristic to get a good approximation. 

b. Quick success in a population. The previous result was for one se- 
quence, so to extend this to the population level, we use: 

Lemma 1. When W = 6 the number of match minus Vs in the popula- 
tion at time is k,2NM , where M has a Poisson distribution with mean 
4.5. 

Intuitively, the correlations between the various sequences are so strong 
that the number of copies in the population is with high probability roughly 
2N times the number, M, in the most recent common ancestor of the pop- 
ulation. We will give details in Section 5. A corollary of this computation 
is that if we define the population consensus sequence to be the most com- 
mon nucleotide at each position, then with high probability the population 
consensus coincides with the sequence of the most recent common ancestor. 

Consider first the situation when M = 1, so that there are roughly 2N 
match minus I's in the population. When N = 10^ and /i = 10~®, the rate for 
getting the right mutation at the desired location is 2Nfi/3 = (2/3) x 10~^, 
so the expected time is roughly exponential with mean 15,000 generations 
or 375,000 years. If M = k, then the waiting time is the minimum of k 
independent copies of the waiting time when M = 1, so the waiting time is 
divided by k. When W = 6 the number of copies is roughly Poisson with 
mean 4.5, so the probability of at least 1 is 0.988. Ignoring the zero cell, the 
expected waiting time is approximately 

oo 4 5^=^ 1 
(9) 375,000 • e-^-^— • - = 107,697 years. 

k=i ^ 

Thus, words of length 6 have an average waiting time of about 100,000 years, 
as quoted in the abstract. Note that the waiting time is not exponential, but 
is a mixture of exponentials. 

c. When there is no almost match. Having taken care of the case W = Q, 
we now focus on = 8. Approximation 3 says that waiting for the word 
to appear at a specified W nucleotides in a population of size is almost 
the same as waiting for the death of a randomly killed version of the fixa- 
tion chain. Thus, to estimate the waiting time where there is not a match 
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minus 1 in the initial condition, we consider the fixation chain [defined as 
in Section 2.3, but this time for Xt{i) € {A,C,G,T}^] that jumps when the 
population is fixed for a new nucleotide at some position. This chain jumps 
at rate L/x, so the probabilities of reaching the target word before the next 
fixation change from the values we have computed previously. When there is 
only one mismatch in a window, the target word appears in some individual 
at rate 2Nfi/3, so the probability this occurs before the next fixation is 

^ 2Nfi/3 ^ 3L/2Af)-i = 20/23 

when L = 1000 and N = 10,000. 

Now consider having two mismatches in a window. Since mutations occur at 
rate 2Nii per nucleotide, those that fix one of the two wrong letters occur 
at rate 2Nfi -2/3. The probability of correcting the other letter before the 
mutation dies out is •1/3, so the probability the target word appears in 
some individual before the next fixation is 

4(iV^)V9 r 

P2 



4(iV/i)2/9 + L/z r + 1' 
where 

47V V 

r = . 

9L 

When L = 1000 and N = 10,000, p2 = 4/9000. 

/9i ~ 1 and in most cases the next few fixations will not affect the word 
with seven matches, so we will stop with probability 1 when a match minus 
1 is found. Let be the death time of the killed chain in which each 
match minus 2 (even those that overlap) independently results in the end of 
the process with probability p2 = 4/9000. The choice in parentheses makes 
programming the simulation easier, but leads to a slight underestimate of 
the actual ending time. 



Approximation 4b. The expected time to find the target word in a pop- 
ulation of size N = 10^ when each individual has L letters is « Et^To / (Lfi) . 

To evaluate Ej^Tu, we turn to simulation. The following results in Table 5 
are based on 100,000 replications of the killed fixation chain. 

To interpret Table 5, recall that the predicted probability of at least one 
match minus 1 in the initial configuration of the L letter fixation chain 
is 1-6-3/^ = 0.3127, which is 5/16 = 0.3125. Thus, there is no match 
minus 1 roughly 11/16's of the time. The smallest mean in the table is 
about 260. When L = 1000 and p = 10~^, new fixations happen at rate 
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Word 


P„ {Td = 0) 


-B„(Td|Td > 0) 


ACAGCTGT 


0.3199 


259.95 


ACAAGGGC 


0.3225 


260.51 


AGAGAGAG 


0.3174 


275.54 


AAAAGAAA 


0.3116 


297.62 


AAGAAGAA 


0.3030 


293.93 


AGAGAGAG 


0.2744 


329.70 



L/i = 10~^, so this translates into 2.6 x 10^ generations or 650 million years. 
When there is exactly one match minus 1 in the initial condition, the waiting 
time is exponential with mean 375,000 years. Hence, the waiting time for 
words of length 8 follows the mixed distribution quoted in the abstract. 

Figures 3 and 4 show simulation results for waiting times for ACAGCTGT 
and AGAGAGAG when there is no match minus 1 in the initial configura- 
tion, with hitting times grouped into bins of size 10. The simulation shows an 
exponential distribution for (TdITd > 0) under Pj^, even though we do not 
have a good reason why the waiting times should have the lack of memory 
property. For the readers who are disappointed to see this final result done 
by simulation, we would like to note that if instead of considering the killed 
fixation chain, we did a simulation of the Moran model for 10,000 diploids 
for 2.6 X 10'' generations, this would take 5.2 x 10^^ simulation steps. Thus, 
even one simulation would be time consuming and 100,000 unthinkable. 

2.5. Conclusions. The calculations in Section 2.4 give the answer to 
Stone and Wray's [20] question: How long do we have to wait for an ex- 
act match to a given W letter word to appear in a segment of length L in 
some individual in a population of N diploids? When W = 6 the mean wait- 
ing time is about 100,000 years, which easily allows differences to accumulate 
in the 6 million years between the divergence of humans and chimpanzees. 
For W = 8, we note that, in combination with Lemma 1, our results have 
shown that the mean waiting time is about 375,000 years if there is a match 
minus 1 in the population consensus sequence, but 650,000,000 years if there 
is not. 

The second answer says that unless you are lucky enough to have a match 
minus 1 in the population consensus sequence, regulatory sequence changes 
can take an extremely long time. However, waiting for exact matches is not 
the right question for the evolution of regulatory sequences. The DNA se- 
quence does not have to be exactly right for transcription factor binding 
to occur. The binding energy between a transcription factor and its DNA 
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ACAGCTGT. 100K simulations of L tetter fixation 




SO too 150 200 250 300 



time/10 

Fig. 3. Waiting time for target word ACAGCTGT in 100,000 replications of the L letter 
fixation chain. We group hitting times into bins of size 10 and plot the logarithm of the 
number of observations versus the hitting time count. Again, we see a good fit of the 
exponential distribution. 

binding site is, to a good approximation, the sum of independent contri- 
butions from a small number of important positions in the binding site 
sequence Yli^i] see [12]. In a commonly used approximation called the two 
state model £i G {0,e} (see [5]), the binding energy is determined by the 
number of mismatches r between the two strings. The binding probability 
is commonly modeled by the Fermi function 

1 

^ 1 + exp(e(r - ro)) ' 

where tq is the threshold for a binding probability of 1/2, and contrary to 
the usual mathematical usage e ~ 2. 

Based on this analysis, it seems reasonable to simplify the biological com- 
plexities of binding of transcription factors by saying that a match minus 
1 is adequate. If we do this, then it follows from the analysis above that it 
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Fig. 4. Waiting time for target word ACACACAC in 100,000 replications of the L letter 
fixation chain. We group hitting times into bins of size 10 and plot the logarithm of the 
number of observations versus the hitting time count. Again, we see a good fit of the 
exponential distribution. 



is enough to have a match minus 2 in the fixation chain. From Approxima- 
tion 4a, the mean number of these in the initial distribution is 33.75 when 
W = 6, and 3.94 when 1^ = 8, so the very long waiting times are unlikely. 
In the case of an 8 letter word, generalizing (9) shows that the waiting time 
has mean 



375,000 



(10) 



k=l 



3.94 



3.94'^ 



1 

k 



■ 61,560 years. 



Here, we have divided by 2 since there are two sites where a mutation can 
upgrade a 6 out of 8 match to 7 out of 8. 

Having announced our results, we should mention two things, each of 
which could change the answer by a factor of 100, but in different directions. 
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First, we have assumed that once the target word occurs in an individual, it 
sweeps to fixation in the population, that is, not long after introduction the 
frequency will rise to 1. In reality the probability of fixation is approximately 
the selective advantage conferred by the mutation s and even for strongly 
beneficial mutations we have s < 0.01. This means that the mutation would 
need to arise more than 100 times in order to achieve fixation, which would 
increase the waiting time to 6 million years. In the other direction, our study 
has focused on changes in the regulation of one particular gene, but there 
are more than 20,000 genes in humans and chimpanzees, and changes in 
even 1% of these genes could be enough to explain the observed differences. 

We have pursued an approach in which mutations are neutral, that is, 
they do not change the fitness of the individual. Several researchers have 
formulated models for the evolution of regulatory sequences, making use of 
models of the binding of regulatory proteins to DNA, similar to the ones 
described above, and assuming the fitness is proportional to the binding 
probability. Gerland and Hwa [13] assumed an infinite population and de- 
veloped a theory based on the quasispecies approach of [10]. MacArthur and 
Brookfield [17] and Berg, William and Lassig [4] have developed related mod- 
els that incorporate genetic drift. These models use different mechanisms, 
but also conclude that the existence of presites (which can be converted into 
binding sites by one mutation, i.e., our match minus I's) largely determine 
where binding sites will evolve. 

The results in this paper depend heavily on the fact that the popula- 
tion size = 10^. When A^ ~ 10^ the estimates in Section 2.3 show that 
we cannot ignore the possibility of triple mutations between fixations. For 
A^ = 10^ which is appropriate for Drosophila, the collection of sequences in 
the population is even more fluid. Experimental results for the even-skipped 
stripe 2 enhancer (see [16] and references therein) suggest that other evo- 
lutionary mechanisms are at work in this case. Carter and Wagner [6] have 
shown that in this case binding domains can shift through a combination of 
a mildly deleterious mutation followed by subsequent selection for a compen- 
satory mutation. Prudhomme et al. [18] have shown recently that regulatory 
changes have caused two independent gains and five losses of wing pigmen- 
tation spots in the Drosophila melanogaster species group. Understanding 
the mechanisms at work in the evolution of gene regulation when A^ = 10^ 
or for larger population sizes appropriate for bacterial or viral genomes is 
an interesting open problem. 

3. Chen-Stein calculations. To fill in the missing details in our treatment 
of a segment of length L in one DNA sequence (Section 2.2), the first detail 
is to consider the following: 
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A T G C A T 
ATGCAT 



A, 



Fig. 5. 



3.1. Success in the initial condition. For a,l3 £ I and A^^Ap defined as 
in Section 2.2, when j3 = a + k for 1 <k <W — 1, Aa and Ajs overlap in 
W — k letters. Figure 5 shows an example with W = 6 and k = 4. 

Let = 1 if the word and the shifted word match exactly on the overlap, 
otherwise. It is easy to see that i?[yaYg] = yk4:~^^~^''\ so 

a /3G_B„\{a} 

W-1 

= ^4-^.2 5] y,4-^- 

a k=l 

W-l 

= 27Q0 where 7 = 14'^ and Qo = E y^^'''- 

k=l 

Qo is the expected number of overlapping matches when shifting to the right 
conditioned on finding a match in the first W nucleotides. The 2 comes from 
the symmetric case of shifting left as well as right. From this formula, it is 
straightforward to calculate the values in Table 2 in Section 2. 2. a. 

3.2. Number of occurrences by time T. Let Pa/^ = E[YaY/^]. We estimate 

Pal3 = Pa=l3 + Pa<l3 + Pf3<a-, 

where Pa=i3 refers to hitting the target word at the same time in two over- 
lapping strips Aa and Ap, while Pa<^p (which is equivalent to the case pp^a) 
refers to hitting the target word in A^ strictly before Ap. 

Consider the target word and its shift by k letters, as in the picture above 
the definition of yk- Define 

= # of matching letters inside the overlap oiW — k letters. 

Consider the moment that we first get a match in A^ and let 

Rk = # of letters matching the target word in Ap \ A^. 

Since we start from a random initial sequence, Rk is Binomial(/c, 1/4). 
We start with Pa=i3- Let = 1 if = W — k, and otherwise 

w-i 

E Pa=/3 = '2paQ, where Q = ^ 
/3gBc\{q} k=l 
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where {W — k) /W is the probabihty the last mutation was in the overlap, a 
necessary condition for hits in and Ajs to occur at the same time. From 
this, it follows that 

(11) Yl E Pa=p = 2\Q. 

Turning to the case Pa<p = Pp<a-, let r{k,j) = P{Rk = j), let h{x) = Px{T\y < 
To) and 

k 

z{mk,k) =Yr{k,j)h{mk + j)l{mk+j<W}- 
j=0 

This is the probability of hitting the target word in A/^ before the number of 
matches in Af^ returns to 0, when there are matches inside the overlap. 
We eliminate matches at time since they are counted in Pa=i3 ■ Since after 
the return to there are <T units of time left, Pa+k is an upper bound on 
the probability of a match after the return to matches and before time T, 

w-i 

E Pa<l3 <'^Pa Y[zimk,k) +Pa+k]- 
l3eB^\{a} k=l 

Letting S = J2h^^ ^i^k,k) and noting that 2J2aPaJ2k^=i^ Pa+k is essen- 
tially bi except we get 2W — 2 instead of 2W — 1, we have 



(12) E [Pa<p+Pp<o.]<2(^^^^—^ + 2XS 



Combining (11) and (12), we have 

62 < {4W - 4)/L + X{2Q + 45) . 
Using this, we can compute the values in Table 3. 

3.3. Clump size. Our next topic is to explain our approximation for the 
expected clump size: 

(13) i,c.i±M±«. 

1 — a 

Intuitively, the numerator gives the expected number of strips in which we 
get a hit before the number of matching letters in that strip returns to 
0. Prom the definition in the previous subsection, we see that 2Q takes 
care of the hits that occur at the same time, 25" the ones that come later. 
Multiplying by 1/(1 — a) accounts for occurrences after the first one in a 
strip. 
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4. Population dynamics. Here we are concerned with the derivation of 
Approximation 3. In that setting we have a population of N diploid indi- 
viduals, each of which has a W letter DNA word. Suppose that at time all 
individuals in the population have the same W letter word except for one 
mutant that differs in one letter, and for the moment ignore the possibility 
of additional mutations. Let Zt = the number of "mutants" in the popula- 
tion at time t, that is, the number of individuals with the mutated word. 
follows the dynamics of the Moran model. That is, when Zt = k, 

I. replace a mutant with a mutant at rate k ■ 
II. replace a mutant with a nonmutant at rate k ■ 

III. replace a nonmutant with a mutant at rate {2N — k)--^, 

IV. replace a nonmutant with a nonmutant at rate (2A^ — k) ■ "^^^^ ■ 

Let Tfc = mi{t >0:Zt = k} be the hitting time of k. We stop the process 
when Zt hits (the mutation dies out) or 2A^ (the mutation becomes fixed 
in the population) and call the result an "excursion." Our first goal in this 
section is to compute the expected value of -B = the total number of mutant 
births in one excursion, that is, the number of times I or III occurs before 
the process enters one of the absorbing states or 2N. Writing ^ bjy to 
indicate aj\[/bN ^ as — > oo, we have 

Lemma 2. Ei{B\To <T2n) N . 

Proof. To prove this claim, we begin by noting that up and down 
jumps occur at the same rate, so the embedded jump chain (which gives the 
sequence of states visited by the continuous time chain) is a simple random 
walk Sn with 5o = 1. 

Simple random walk. Again let = inf{n > : Sn = k} be the hitting 
time of k. Let A^ = number of visits to state k for Sn before Tq. Our first 
step is to compute 



where = inf{n > I : Sn = k} is the return time. To explain the formula, 
the numerator gives the probability the mutants achieve a population of size 
k. Once this occurs, the simple random walk has a geometric number of visits 
to k with success probability Pk{T^ > Tq\Tq < T2n)- Using the definition of 
conditional probability and the Markov property. 



^i(Afe|ro<r2^) 



Pi(Tfc<To|To<r2jv) 

Pfc(r+>ro|To<T2jv) 



Pi(rfc<ro|ro<T2jv) 



Pi(rfc<To,To<T2jv) 
Pi{To<T2n) 

PiiTk<To)Pk{To<T2N) 
Pi{To<T2n) 
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(14) 

(l/fc)(l-fc/(2iV)) 
1 - 1/{2N) 
2N-k 



k{2N -1)' 

The next to last equality comes from the fact that Sn is a martingale and, 
hence, Pi{Tj < Tq) = i/j. Tm'ning to the denominator of Ei{Nk\TQ < T2n), 

PUT, >To|ro<r,^)- p^^j^^^ 

(15) 

(1/2) . (l/k) 2N 



l-k/{2N) 2k{2N-k)' 

The 1/2 comes from the fact that 5„ has to jump from A; to /c — 1 in order 
to be able to hit before going back to k. Combining our results gives 

/ 2N-k \( 2k{2N-k) \ 2{2N-kf 
(16) i^i(iV.|ro<r2^) = (^^^^^j(^ ^AT j = 2iV(2iV-l)- 

Moran model. Next we want to compute the mean number of mutant 
births while the Moran model is in state k. Let be the number of type I 
events before a jump makes the chain leave k. Let Jj, be the number of type 
i jumps while the chain is in state k: 

Eii4\To < T2n) = EiNkin < T2n) ■ EWk. 

Wk has a shifted geometric distribution with mean EW, = 1/P(jump) — 1, 
where 

P(jump) = 2M2iV-fe)/(2iV) ^ m-2k 

^' kk/{2N)+2k{2N -k)/{2N) AN - k ' 

which implies EW, = k/{4:N — 2k). Using (16) and summing, we have 

'^^^,,1,^ ^ , 2{2N-kf k 

X. E,{J, |To < T2n) = 2iV(2Ar - 1) ' 

~ 5i ~ 2iVy 2iV - 1 2N 

~ 2N /V - x)xdx = A^/3. 

To compute the expected number of type III jumps in state /c, we begin 
by noting that the probability we ever visit k is given by (14). Using the 



2^-1 ^ k\ k 1 
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calculation for (15), 

Pfc(no up jump from k before ToITq < T2n) 

(17) 

_ 1/(A; + 1) _ 2N 



l-k/{2N) {k + l){2N-ky 

Since the number of up jumps from k for the conditioned chain has a shifted 
geometric distribution, using (14) gives 

^, o _ , 2N-k f{k + l){2N-k) 
i?i(4|To<r2^) = ^^2^_^^(^ ^ 1 

If k and 2N — k are large, then k{2N — k)/2N is much bigger than 1 and 
the —1 can be ignored. Summing, we have 



iV fcl u 2NJ ^ 2N{2N-\) 2N 



'2N [\l-xfdx = 2N/3. 
Jo 



Adding this to the expected number of jumps of type I gives the desired 
result. □ 

We next consider the final excursion in which the new mutant fixes. 
Lemma 3. Ei{B\T2n < To) ~ 2N'^. 

Proof. We use the notation of the previous proof. Here, Pi (T^ < To\T2n < 
To) = 1, so 

Er{N,\T2. < To) = p^^^,- ^ tL\T2j, < nY 

The denominator 

{l/2)Pk+i{T2N<Tk) (1/2) • l/(2iV - A:) 2N 



PkiT2N<To) k/{2N) 2k{2N-ky 

since for simple random walk Pk+i{T2N < T^) = Pi{T2N-k < To)- To com- 
pute the expected number of jumps of type I in state k, we note that 

E^iJl\T2M < To) = EiN,\T2r, < To) • EW, = ^^^^^ • = ^. 

Summing, we have 

2N-1 

E i=^i(Tfc|T2^<To)~(2iV)V3. 

k=l 



26 



R. DURRETT AND D. SCHMIDT 



To compute the expected number of jumps of type III in state k, we begin 
by using the computation in (17) to conclude 

Pfc+i(no down jump from k + 1 before r2Ar|r2Ar < Tq) 

= Pk+l{Tk>T2N\T2N <To) 

2N 



P2N-k-l{T2N~k >Tq\To < T2N) 



{k + l){2N -k)' 

Thus, the expected number of down jumps from k+1 (which come from 
type II changes): 

^l[Jk+l\'^2N < Jo) - ^ 1- 

Since on {T2N < Tq} this is one less than the expected number of up jumps 
from k, 

3 (fc + l)(2jV-fc) 

^l{'Jk\-^2N < J^O) - ^ • 

Summing, we have 

27V-1 1 

J2 Ei{Jl\T2N <n) ^ {2Nf x{l-x)dx = {2Nf/Q. 
k=i 

Adding this to the expected number of jumps of type I gives the desired 
result. □ 

Derivation of Approximation 3. As explained in Section 2.3, fixations 
occur at rate Wfi and soon after Ln = W — 1 we will have a mutation that 
gives us the target word. 

The next step is to consider the possibility that the target word will be 
reached when the fixation chain Ln = W — 2. Lemma 2 shows that if B is 
the number of times a new mutant is born in an excursion, then 

Ei{B\To<T2n)^ N. 

Given this result, if Ln = W — 2, the expected number of times that we will 
have a correct mutation at the start and the right second mutation before 
the first one dies out is 

2 1 2Nn 
N-Wn = -. 

3W ^ 3W 9W 

The first factor gives the fraction of the 3W possible mutations that fix one 
of the mismatches, the second, the expected number of births per excursion 
that can result in mutation (by Lemma 2), the third, the mutation proba- 
bility and the fourth factor the fraction of the 3W possible mutations that 
fix the one remaining mismatch. 
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Since fixation has probability 1/2N, thinking about a race between two 
rare events shows that the probabihty of reaching the target word before 
the next fixation is approximately 

2Nfi/9W 



l/2N + 2Nfi/9W l + m^n/9W' 
To verify this, we note that the probability of not reaching the target word 



IS 



»i:a/-.(i-f^-^)" 

l/(2iV) 



2Af^/(9VF) + l/(2Af) l + 4iVV/9VF' 

An important consequence of this calculation is that we can ignore the 
interaction between mutations. When = 8 and \i = 10~®, the probability 
that between two successive fixations a second mutation arises before the 
previous one dies out is 1/19. The probability this mutation reaches size K 
in the population is 1/i^, so with high probability none will reach a large 
size. There is also the issue of second mutations while a successful mutation 
is reaching fixation. As Lemma 3 shows, 

Ex{B\T2N <Tq)^2N'^. 

Thus, the expected number of second mutations during this excursion is 
2Af2^ = 2 (when iV = 10^ and [i = W~^), but with high probability none will 
reach a large size. We ignore the possibility of finding the target word in the 
final excursion when Ln = W — 2, since this can happen only if Ln+i = W — 1, 
in which case the killed fixation chain will terminate at time n+ 1. 

To rule out success when Ln = W — 3, we extend the reasoning used for 
Ln = W — 2 to conclude that the expected number of good triple mutations 
(i.e., ones that will increase Ln = W — 3 to W, bringing us to the target 
word) between two fixations is 

2N.^.N.W,.^.N.W,. ' 



3W 3W " 3W 9W 

The first factor is the expected number of excursions between two fixations, 
the second the fraction of the 3W possible mutations that fix one of the 
mismatches, the third the expected number of births per excursion that 
can result in mutation, the fourth the mutation probability, the fifth the 
fraction of the 3W possible mutations that fix one of the two remaining 
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mismatches and so on. When W = 8, using the fact estabhshed in (25) that 
E{N5{t7)) = 80, we see that the probabihty of a good triple mutation before 
we find the target word is 4.44 x 10"'^. 

5. Proof of Lemma 1. The calculation for (7) has shown that the ex- 
pected number of mutations in a segment of length L in a population of N 
diploids is Lfi ■ 4iVln(2iV) = 3.95 when L = 1000, N = 10,000 and n = lO'^. 
Since 4.5 windows of length 6 contain 27 nucleotides, this implies that the 
expected number of match minus I's that are disrupted by mutation is 
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3.95 
1000 



0.1066. 



In the other direction, since there are an average of 33.75 match minus 
2's and each has only two sites that can be fixed by one of three possible 
mutations, the number of match minus I's created is 



33.75-2-i-^=0.0888. 



The last two expected values are not negligible, but mutations when they 
exist are rare. Well-known results about the site frequency spectrum imply 
that the expected number of individuals with the mutant nucleotide at a 
site given that a mutation has occurred is 



2N -I 

y — - — 



2N 



ln(2iV) 



:0.1009(2Af) 



when = 10,000. 



APPENDIX: COMPUTATIONS FOR THE MUTATION CHAIN 

Our first task is to explain the computation of the mixing time T2 that 
appears in the derivation of Approximation 1. Let 



U: 



/1/4 


1/4 


1/4 


l/4\ 


1/4 


1/4 


1/4 


1/4 


1/4 


1/4 


1/4 


1/4 


Vl/4 


1/4 


1/4 


1/4/ 



V 



1/3 



1/3 


1/3 


l/3\ 





1/3 


1/3 


1/3 





1/3 


1/3 


1/3 


; 



The eigenvalues of U are 1, 0, 0, 0, since any vector x orthogonal to the 
constant vector has Ux = 0. To compute the eigenvalues of V, we note that 
V = (4/3)f/ — (1/3)/, so its eigenvalues are 1, —1/3, —1/3, —1/3 and the 
spectral gap is 4/3. If we consider the discrete time chain in which there 
are W letters and one changes at each step, the spectral gap is 4/3VF or 
T2 = 3W/4:. Of course, if we speed up the chain so that each coordinate 
jumps at rate 1, then r2 = 3/4. 
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Let Xn have state space 5 = {0, 1, ... , W} and transition probabilities 

p{x, X — 1) = x/W, 

(19) p{x,x + l) = {l/3){W -x)/W, 

p{x,x) = {2/3){W -x)/W, 

where all other p{x, y) = 0. In this section we will use standard Markov chain 
techniques to compute several quantities of interest for this Markov chain. 

Let h{x) = Px{Ty/ < To) be the probability of hitting W before hitting 
state when starting in state x. h{x) satisfies 

(20) h{x) =p{x,x + l)h{x + 1) +p{x,x — l)h{x — 1) + p{x,x)h{x) 

for < x < W, with boundary conditions h{W) = 1 and h{0) = 0. We solve 
for h{x) recursively. Since 1 — p{x, x) = p{x, x + 1) + p{x, x — 1), rearranging 
gives 

p{x,x — l)[h{x) — h{x — 1)] =p{x,x + l)[h{x + 1) — h{x)], 
which implies 

'p{x, X + 1) 



h{x)-h{x-l) 



(21) 



p{x, X — 1) 
W-x 



[h{x + 1) - h{x)] 



3x 



[h{x + 1) - h{x)]. 



Setting h{W) — h{W — 1) = C and using the recursion in (21) gives 

1 



h{W -l)-h{W -2) 
h{W - 2) - h{W - 3) 



3{W-l) 
2 

3{W -2) 



1 



3{W -I) 



h{W -k)-k{W -k-l) 
Simplifying, we have 
(22) h{W -k)-h{W -k-l) = 



l---k 



3fc(Vl7-i)...(VF-fc) 



C. 



k\{W-k-l)\ 



3^(VF- 1)! 



C 



c 



To solve for C now, we use a telescoping series to conclude 



I = h{W) - h{0) = C 



E 



L k= 



1 

3^(V)J 
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Table 6 



X 


W = Q 


W = 8 


1 


0.003782 


0.0004334 


2 


0.006051 


0.0006190 


3 


0.009455 


0.0008047 


4 


0.01966 


0.001139 


5 


0.08093 


0.002141 


6 




0.007156 


7 




0.05228 



Using the last two formulas, we can compute h[x) for each < x < W , 
given W . Here and in what follows numerical results were obtained by using 
C and/or Matlab programs. Our Table 6 gives the values of h{x). 

Since Pw{T^ < Tq) = P\y-i{Tw < Tq) = h{W — 1), this gives the values 
of a quoted in Section 2.2. 

(20) implies that h{Xn) is a martingale so if a < x < 6, then 

(23) P.(r„<r,)-^^^^_^^^^, P^(T^>Tt)-j^^^y-^. 

From this we can compute Green's function Gb{x,y) = Ex{Ny{Tb)) = the 
expected number of visits to y starting from x before hitting b. If x,y < b, 
then 

(nA\ p (AT (rp\\ _ PxjTy < Tb) P'xjTy < Tb) 

Py{T^ > Tb) p[y, y + l)Py+i [Ty > Tb) 

In words, the numerator gives the probability that we get to y before hitting 
b. If we reach y, then we will return a geometric number of times with 
"success" probability Py{T^ > Tb). In order to hit b before returning to y, 
we must go to y + 1 on the first jump and then hit b before y. Some concrete 
examples that we will need are 

(25) '^'^'^ 

F (^r(r)) 1 p(6,5)+p(6,7) 

- pi5,min>T,) - p(5,6)p(6,7) - 

Our next goal is to compute expected hitting times. We could do this 
by summing the Green's function: ExTb = J2y<b^b{x,y). However, we will 
also need results for conditioned chains, so we will do this by solving equa- 
tions. If r{x,y) is any irreducible nearest neighbor transition probability on 
{0, 1, . . . , W}, then u{x) = E^Tb satisfies 

u{x) = 1 + r{x, X + l)u(x + 1) + r{x, x)u{x) + r(x, x — l)u{x — 1) 
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X 


y = 8 


7 


6 


5 


4 


3 


2 


1 





69104.23 


3569.23 


449.66 


104.37 


36.91 


16.43 


7.71 


3.00 


1 


69101.23 


3566.23 


446.66 


101.37 


33.91 


13.43 


4.71 





2 


69096.51 


3561.51 


441.94 


96.66 


29.20 


8.71 







3 


69087.80 


3552.80 


433.23 


87.94 


20.49 









4 


69067.31 


3532.31 


412.74 


67.46 











5 


68999.86 


3464.86 


345.29 













6 


68654.57 


3119.57 















7 


65535.00 


















for < X < with u{h) = and u(0) = 1 + r(0, O)m(O) + r(0, l)u{l). 
Since transition probabilities must sum to 1, 

r(x, X + l)[u{x) — u{x + 1)] = 1 + r(rE, x — l)[u{x — 1) — u{x)\ 

and it follows that 

, , r / N / M 1 + rix,x — l)\uix — I) — uix)] 

(26) [u{x) - u{x + 1)] = \ ' 

r[x, X + \) 

Using n(0) — u{l) = l/r(0, 1), we can iterate to find the successive differences 
and then use u{h) = to determine the function. This procedure is enough 
for numerical computations so we will not give a formula for u. The values 
of the hitting times E^Ty when W = S are given in Table 7. Note that 
u{x) — u{x + 1) does not depend on y, so the differences between values in 
two successive rows are constant, and are equal to the value sitting above 
the 0. 

We compare EqT^ with E-,^T]y in Section 2.1, where E.,^Tw = ■K{x)ExTyY . 
Next we need to compute hitting times for the chain conditioned to hit 
W before 0. To compute the transition probability, 

q{x, y) = P{Xi = y\XQ = x,Tw < Tq) 

_ P^jXi = y)Py{Tw < To) _ p{x,y)h{y) 
P^{Tw<Tq) h{x) 

One can use (26) for this chain as well. As Table 8 for = 8 shows, this 
dramatically reduces the hitting times from the unconditioned values ExT^ > 
65,535. 

To compute EqS (recall that S is defined in Approximation 3 in Sec- 
tion 2.3), we begin by noting that EqS = Eqtw-2 + Ew-2S. Then by con- 
sidering what happens on the first jump from W — 2, 

Ew-2S = (1 - p)[p{W -2,W-1)- l+p{W -2,W-2)- Ew-2S 

+ p{W -2,W -?,)■ {Ew-^Tw-2 + Ew-2S)] . 



32 



R. DURRETT AND D. SCHMIDT 



Solving, we have 

Ew-2S={l-p) 



p{W -2,W-1)+ pjW -2,W- 3)Ew-3Tw-2 
l-il-p){l-p{W-2,W-l)) 



Table 



X 


E^{Ts\Ts < To) 


E^{To\Ts < To) 


8 





47.229002 


7 


2.156068 


46.229002 


6 


8.152877 


45.138092 


5 


19.963106 


43.696338 


4 


31.794219 


41.811331 


3 


39.459771 


39.184505 


2 


43.829002 


35.059746 


1 


46.229002 


26.937262 





47.229002 
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